Bootstrapped Authorship Attribution in Compression Space Notebook for PAN at CLEF2012

نویسندگان

Ramon de Graaff

Cor J. Veenman

چکیده

From a machine learning standpoint, the PAN 2012 Lab contest had one major challenge. In all authorship attribution tasks, the number of training documents was extremely low. We extended our previous work, in which compression distances to randomly selected prototype documents from the training corpus were used as feature representation. A supervised multi-class classifier was learned in the resulting feature space using the remaining documents. Inspired by the bootstrapped resampling method, we now drew document samples from the few source documents in order to obtain sufficient prototypes and samples to learn a supervised classifier. Using internal validation, we tuned the size of the document samples, compression method, distance measure, classification method, and decision threshold (open-class tasks) for optimal F1 score. With this scheme we submitted for the closed-class and open-class author identification tasks. In the overall results for these tasks we achieved a shared fourth ranking, based on the reported average recall of the 11 teams.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Bootstrapped Authorship Attribution in Compression Space

متن کامل

Authorship Attribution using Compression Distances

Authorship attribution has been a field of interest for researchers in the past, especially for forensic purposes. In this thesis, to obtain the degree of Bachelor of Science from the Leiden University, we investigate character n-grams and so-called compression distances to prototypes on several datasets, i.e., the datasets provided by PAN Labs (a benchmarking activity on uncovering plagiarism,...

متن کامل

Authorship Identification in Large Email Collections: Experiments Using Features that Belong to Different Linguistic Levels - Notebook for PAN at CLEF 2011

The aim of this paper is to explore the usefulness of using features from different linguistic levels to email authorship identification. Using various email datasets provided by PAN’11 lab we tested several feature groups in both authorship attribution and authorship verification subtasks. The selected feature groups combined with Regularized Logistic Regression and One-Class SVMmachine learni...

متن کامل

EPSMS and the Document Occurrence Representation for Authorship Identification - Notebook for PAN at CLEF 2011

This paper describes the participation of the PISIS team in the authorship identification track of PAN’11. We adopted two different strategies for the tasks of authorship attribution and authorship verification. For authorship attribution we performed experiments with a document occurrence representation using a standard classification-based approach. Results obtained with this approach were mi...

متن کامل

Vote/Veto Meta-Classifier for Authorship Identification - Notebook for PAN at CLEF 2011

For the PAN 2011 authorship identification challenge we have developed a system based on a meta-classifier which selectively uses the results of multiple base classifiers. In addition we also performed feature engineering based on the given domain of e-mails. We present our system as well as results on the evaluation dataset. Our system performed second and third best in the authorship attribut...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2012

Bootstrapped Authorship Attribution in Compression Space Notebook for PAN at CLEF2012

نویسندگان

چکیده

منابع مشابه

Bootstrapped Authorship Attribution in Compression Space

Authorship Attribution using Compression Distances

Authorship Identification in Large Email Collections: Experiments Using Features that Belong to Different Linguistic Levels - Notebook for PAN at CLEF 2011

EPSMS and the Document Occurrence Representation for Authorship Identification - Notebook for PAN at CLEF 2011

Vote/Veto Meta-Classifier for Authorship Identification - Notebook for PAN at CLEF 2011

عنوان ژورنال:

اشتراک گذاری